Streamlined quantitative analysis of histone modification abundance at nucleosome-scale resolution with siQ-ChIP version 2.0

We recently introduced an absolute and physical quantitative scale for chromatin immunoprecipitation followed by sequencing (ChIP-seq). The scale itself was determined directly from measurements routinely made on sequencing samples without additional reagents or spike-ins. We called this approach sans spike-in quantitative ChIP, or siQ-ChIP. Herein, we extend those results in several ways. First, we simplified the calculations defining the quantitative scale, reducing practitioner burden. Second, we reveal a normalization constraint implied by the quantitative scale and introduce a new scheme for generating ‘tracks’. The constraint requires that tracks are probability distributions so that quantified ChIP-seq is analogous to a mass distribution. Third, we introduce some whole-genome analyses that allow us, for example, to project the IP mass (immunoprecipitated mass) onto the genome to evaluate how much of any genomic interval was captured in the IP. We applied siQ-ChIP to p300/CBP inhibition and compare our results to those of others. We detail how the same data-level observations are misinterpreted in the literature when tracks are not understood as probability densities and are compared without correct quantitative scaling, and we offer new interpretations of p300/CBP inhibition outcomes.

The siQ-ChIP scale α can be obtained as a units conversion applied to the IP reaction efficiency as follows. The heart of siQ-ChIP is the realization that the IP is subject to the basic mass conservation laws that govern all reversible binding reactions. Namely, the total antibody concentration is equal to the sum of the free antibody and bound antibody concentrations. Because of this, the IP mass must follow a sigmoidal form, where increasing antibody concentration causes increased IP mass up until the reaction is saturated. As we explain next, the work of siQ-ChIP is concerned with two features: the determination of the isotherm and the units conversion that maps IP mass to concentration of antibody-chromatin complex. The concentration of complex is what sets the quantitative scale for siQ-ChIP.
In more formal terms, the sum of free antibody (AB f ) and bound antibody takes the following form where we used the traditional binding constant definition K B,i = [AB · S i ]/AB f S f i . S i is the i-th species or epitope that interacts with the antibody and [AB · S i ] is the concentration of complex. The total antibody mass is also subject to a conservation of mass constraint for each species, S t i = S f i + AB f K B,i S f i where S t i is the total concentration of species i. S f i is the free (or unbound) concentration of species i.
The symbol S t i represents the concentration of a chromatin 'state'. Without trying to enumerate all possibilities, these could include all mono-nucleosome fragments that present a defined set of histone modifications. There may be another species S t j for the di-nucleosome fragments that present the same modifications. Yet another term, S t k , for mono-nucleosomes presenting different modifications or combinations of modifications, and so on.
Of interest here is the solution to these mass conservation laws. The solution is just the set of values S f i and AB f that would simultaneously satisfy all of the conservation equations. If we knew the binding constants K B,i then we could generate the solution numerically. Of course, we do not know the binding constants and we also don't know how to enumerate all of the terms in the conservation laws, but we have a very handy way to make these shortcomings moot: We determine the actual IP mass empirically, which is itself the sum of all the bound fragments whatever they are and however they came to be there. We can empirically determine this correct mass without needing to know all the terms and constants exactly.
Formally, we have the total bound concentration of , which for our model can be expressed as where we used the bound concentration S t i − S f i . The total S b is the sum of sigmoids thus, as described above, we anticipate that S b will plateau or saturate when AB t is increased.
The key for siQ-ChIP is that this concentration S b can be converted to mass using the average molecular weight per base pair (660 g/mol/bp) and the average fragment length L, yielding which is the IP mass. The factor (V − v in )660L converts from concentration units to mass units. Now, once m IP is determined empirically S b can be estimated using this unit conversion. We can do a similar thing with the determined input mass, m input = v in 660L in S t where S t is the total chromatin concentration. The quantitative scale put forward by siQ-ChIP is based on the fact that the total IP capture efficiency can be expressed as . Because some of the IP and input masses will be sequenced, we have knowledge of the genomic coordinates for a representative collection of the chromatin fragments. Using x to denote genomic coordinates and f ′ (x) to denote any proper summary of the sequenced fragments (e.g., We say proper here because |f ′ | =R implies that no fragment can be counted more than once. This places a strict constraint on how sequencing tracks are built and interpreted. Typical practice will over count sequenced fragments, with each fragment counted once for each base pair in the fragment. The key result here is that , which is an estimate of the concentration of bound fragments at x.  Fig. 1: The impacts of using expected input, or 'fake input', to regularize siQ-scaling.

PROCESSING SEQUENCING DATA
The siQ-ChIP scale is built on the IP to input ratio because it expresses efficiency of capture. Ultimately, this leaves us to evaluate αf IP (x)/f in (x) and to deal with the inevitable case that f in (x) ∼ 0 while f IP (x) > 0. In these cases the IP demonstrates that the genomic region represented by x was present in the chromatin but for statistical reasons has not been presented in the input sequence data.
Because the sequenced input fragments are expected to be binomially distributed along the genome, we estimate the average expected depth, d , of input at any position x as d =R in p/(1 − p) where p is the probablity of hitting any base pair in the genome. In our case we use bins larger than a single base pair and p is adjusted to this width. (p = 30/3200000000 for bins of 30 base pair and a total of 3200000000 bases.) Any time f in (x) < d we replace the input with d . We refer to this replacement as 'fake input' and an example of how this impacts data is shown in SI- Fig 1. The siQ-scale should not be larger than unity for any reason other than noise in the determination of α. SI- Fig 1 shows how using the 'fake input' resolves the over unity problem, where it results from sampling errors in the input track. Over unity peaks are still possible, but are less likely.
The siQ-ChIP sequencing track is given by Any genomic interval X that has signal satisfying s(x) > s for all x ∈ X and s(x) > s + 3σ for some x ∈ X is understood as displaying a peak. This is a simple choice for selecting intervals that have signal larger than apparent background, we did not experiment with values other than 3σ but one can set this value in the siQ-ChIP scripts.
As noted in the Main text, part of the database of peaks includes the Fréchet distance between control and experimental data. The Fréchet distance is a metric of shape-similarity between a peak in the control track and the experimental track. This similarity is computed for each interval X , where the tracks on the interval are mapped to the unit square. We map to the unit square so that there is no unit based disparity between the xand y-coordinates of the tracks and so that the notion of shape is independent of the height of the peaks. To appreciate the quantitative shape comparison, one can imagine the unit square as a visual display, like a projector screen, comprised of pixels. If the control and experimental tracks are ploted on the display, (dF ) −2 gives us an idea of the most pixels the display can have while still allowing the two curves to look similar by eye. A large number of pixels implies a high resolution match, corresponding to a small dF value. Conversely, low resolution matches have large values of dF .
For example, a value of dF = 0.2 gives us 25 pixels while a distance of 0.4 gives us a 6 pixel display. This small displacement of 0.2 in the value of dF generates a 4-fold reduction the effective resolution for comparing the data. As a rough guide, values smaller than 0.3 will be generally agreed upon as looking similar where larger values will not. In SI -Fig 2 we illustrate how the metric looks for several actual peak comparisons.
The extent to which peak shapes ought to be conserved between samples or treatments has not been quantitatively characterized. SI -Fig 3 reports all the shape response distributions. We point out that the units and scale of the Fréchet metric take some getting used to. To help calibrate to the scale of dF , SI- Fig. 2 reports on a few values of dF . SI- Fig. 4 reports our global observations of histone acetylation after p300/CBP inhibition. Global losses are clearly reported for A485 while little can be appreciated for effects of CBP30. SI- Fig. 5 reports on a biological repeat of the isotherms for chromatin:antibody reactoins and reports all beadonly capture amounts. No bead-blocking or preclearing is used, and almost all bead-only capture masses are below 1% by mass. * Electronic address: bradley.dickson@vai.org  SI- Fig. 2: Examples of peaks in a comparison of DMSO and A485 tracks with the K18ac antibody. The peaks are projected to a unitless rectangle and the Fréchet distance is computed. The circles have radii matching the Fréchet distance dF . The interval containing a peak in the control track is marked by tics on the top of each bar graph and is expanded to the unitless rectangle where the Fréchet distance is illustrated. The region called to contain a peak runs from x=0 to x=1, in each case, as indicated.